Crawling Back and Forth: Using Back and Out Links to Locate Bilingual Sites

نویسندگان

  • Luciano Barbosa
  • Srinivas Bangalore
  • Vivek Kumar Rangarajan Sridhar
چکیده

This paper presents a novel crawling strategy to locate bilingual sites. It does so by focusing on the Web graph neighborhood of these sites and exploring the patterns of the links in this region to guide its visitation policy. A sub-task in the problem of bilingual site discovery is the job of detecting bilingual sites, i.e., given a Web site, verify whether it is bilingual or not. We perform this task by combining supervised learning and language identification. Experimental results demonstrate that our crawler outperforms previous crawling approaches and produces a high-quality collection of bilingual sites, which we evaluate in the context of machine translation in the tourism and hospitality domain. The parallel text obtained using our novel crawling strategy results in a relative improvement of 22% in BLEU score (English-to-Spanish) over an out-ofdomain seed translation model trained on the European parliamentary proceedings.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Scalable Approach to Building a Parallel Corpus from the Web

Parallel text acquisition from the Web is an attractive way for augmenting statistical models (e.g., machine translation, crosslingual document retrieval) with domain representative data. The basis for obtaining such data is a collection of pairs of bilingual Web sites or pages. In this work, we propose a crawling strategy that locates bilingual Web sites by constraining the visitation policy o...

متن کامل

Validation of back-calculation methods using otoliths to determine the length of anchovy kilka (Clupeonella engrauliformis)

Age structure of the Caspian Sea anchovy kilka, Clupeonella engrauliformis, was estimated for the first time by back-calculation methods. Otolith growth and the rate of increment in anchovy kilka were examined to determine whether otoliths could be used to back calculate body sizes at various life stages. Sampling was carried out on commercial fishing vessels board along the Iranian coast in 20...

متن کامل

Validation of back-calculation methods using otoliths to determine the length of anchovy kilka (Clupeonella engrauliformis)

Age structure of the Caspian Sea anchovy kilka, Clupeonella engrauliformis, was estimated for the first time by back-calculation methods. Otolith growth and the rate of increment in anchovy kilka were examined to determine whether otoliths could be used to back calculate body sizes at various life stages. Sampling was carried out on commercial fishing vessels board along the Iranian coast in 20...

متن کامل

Workload-Aware Web Crawling and Server Workload Detection

With the development of search engines, more and more web crawlers are used to gather web pages. The rising crawling traffic has brought the concern that crawlers may impact web sites. On the other hand, more efficient crawling strategy is required for the coverage and freshness of search engine index. In this paper, crawlers of several major search engines are analyzed using one six-months acc...

متن کامل

Prevalence of low back pain and its related factors among pre-hospital emergency personnel in Iran

Objective: Low back pain is one of the most important job injuries among emergency medical personnel. This study was carried out to investigate the prevalence of low back pain as well as its physical, mental and managerial predisposing factors among emergency medical personnel in Iran.Methods: In this analytical cross-sectional study we recruited 298 pre-hospital emergency medical personn...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011